In this project, we focus on studying the air crash investigation dataset from National Transportation Safety Board (NTSB). We want to study the question: How well can flight parameters (engine type/number, weather condition, the broad phase of flight, etc.) be used to predict the fatality of a flight?
The dataset we used contained information on air crashes investigated by the NTSB from 1982 to 2007. The investigation range of NTSB includes all air crashes happened within U.S. territory, or when the involved aircraft are U.S. made, or the aircraft carries passengers/crew members that are U.S. citizens, or otherwise invited by the transportation safety bureau from another country. Commercial airline crashes as well as other aviation crashes are included in this dataset.
##Setup
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(broom)
filename <- "./data/AviationData.txt"
df_aviation <- read.delim2(filename, header = TRUE, sep = "|", dec = ".")
df_aviation
df_aviation %>% glimpse()
## Rows: 63,913
## Columns: 32
## $ Event.Id <fct> 20080125X00106 , 20080206X00141 , 20080109X000…
## $ Investigation.Type <fct> Accident , Accident , Accident , Accident …
## $ Accident.Number <fct> SEA08CA056 , CHI08WA075 , NYC08FA071 , LAX…
## $ Event.Date <fct> 12/31/2007 , 12/31/2007 , 12/30/2007 , 12/…
## $ Location <fct> " Santa Ana, CA ", " Guernsey, United Kingdom …
## $ Country <fct> United States , United Kingdom , United Sta…
## $ Latitude <dbl> 33.67556, 49.43500, 34.68861, 35.54222, 45.866…
## $ Longitude <dbl> -117.868056, -2.600278, -87.920000, -120.52277…
## $ Airport.Code <fct> SNA , , , PRB , AXN , , , , , …
## $ Airport.Name <fct> John Wayne - Orange County , , , Paso Ro…
## $ Injury.Severity <fct> Non-Fatal , Non-Fatal , Fatal(3) , Fatal(1…
## $ Aircraft.Damage <fct> Substantial , Minor , Substantial , Substa…
## $ Aircraft.Category <fct> Airplane , , Helicopter , Airplane , Air…
## $ Registration.Number <fct> N2800D , , N109AE , N254SR , N5093F , …
## $ Make <fct> Piper , Cessna , BELL , Cirrus Design Corp…
## $ Model <fct> PA-12 , T303 , 206L-3 , SR22 , RV-8 , R-…
## $ Amateur.Built <fct> No , No , No , No , Yes , No , Yes , N…
## $ Number.of.Engines <dbl> 1, NA, 1, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1, NA, N…
## $ Engine.Type <fct> Reciprocating , , Turbo Shaft , Reciproca…
## $ FAR.Description <fct> Part 91: General Aviation , , Part 91: Gen…
## $ Schedule <fct> , , , , , , , NSCH , , , …
## $ Purpose.of.Flight <fct> Instructional , Unknown , Other Work Use , …
## $ Air.Carrier <fct> " ", " ", " ", " ", " ", " ", " ", " "…
## $ Total.Fatal.Injuries <dbl> NA, NA, 3, 1, NA, 1, 2, 1, NA, NA, NA, 0, NA, …
## $ Total.Serious.Injuries <dbl> NA, NA, 0, NA, NA, 1, 0, 1, NA, NA, NA, 1, NA,…
## $ Total.Minor.Injuries <dbl> NA, NA, 0, NA, NA, NA, 0, 2, NA, NA, NA, 0, NA…
## $ Total.Uninjured <dbl> 2, 1, 0, NA, 1, NA, 0, NA, 5, 1, 2, 0, 1, 2, 2…
## $ Weather.Condition <fct> VMC , , VMC , VMC , VMC , VMC , VMC , …
## $ Broad.Phase.of.Flight <fct> LANDING , , MANEUVERING , MANEUVERING , …
## $ Report.Status <fct> Probable Cause , Foreign , Probable Cause ,…
## $ Publication.Date <fct> 02/28/2008 , 02/06/2008 , 01/15/2009 , 06/…
## $ X <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
Observations: A glimpse at the dataset shows that 63,913 accidents and incidents are recorded in this dataset. 32 columns are presented, in which the total number of fatal, serious, and minor injuries, as well as the total uninjured number are recorded as output variables, and other variables including Make, Model, Number.of.Engines, Engine.Type, Purpose.of.Flight and Broad.Phase.of.Flight are interpreted as input variables in this study.
df_binomial <- df_aviation %>% mutate(Fatal = as.logical(Total.Fatal.Injuries))
df_binomial %>% count(Fatal)
Observations: A simple check on the Total.Fatal.Injuries column of the dataset reveals that from 1982 to 2007, there are 39930 non-fatal crashes and 12582 fatal crashes, and 11401 flight crash information are not available.
df_binomial %>%
filter(is.na(Fatal) == FALSE) %>%
group_by(Weather.Condition) %>%
count(Fatal) %>%
ggplot(aes(x = Weather.Condition, y = n, fill = Fatal)) +
geom_col(position = "dodge") +
ggtitle("Number of Fatal and Non-fatal Crashes in Different Weather Conditions")
Observations: VMC, or visual meteorological conditions, means that pilots have sufficient visibility to fly the aircraft maintaining visual separation from terrain and other aircraft. IMC, or instrumental meteorological conditions, means that pilots need to fly primarily by reference to instruments, and therefore under instrument flight rules (IFR). Typically, IMC means flying in cloudy or bad weather, while VMC means fine weather.
Comparing the absolute value of fatal crashes in VMC is higher than IMC. However, this may be due to the fact that much more flights are flown in VMC, as flights are likely to be canceled or diverted in deteriorating weather. Comparing the ratio of fatal and non-fatal crashes in VMC and IMC, it turns out that in IMC the crashes have a higher odds of being fatal.
The contribution of weather condition to a fatal crash will be investigated in later sections by fitting it into a logistic model.